knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
select * from book
| book_id | title | isbn13 | language_id | num_pages | publication_date | publisher_id |
|---|---|---|---|---|---|---|
| 1 | The World’s First Love: Mary Mother of God | 8987059752 | 2 | 276 | 1996-09-01 | 1010 |
| 2 | The Illuminati | 20049130001 | 1 | 352 | 2004-10-04 | 1967 |
| 3 | The Servant Leader | 23755004321 | 1 | 128 | 2003-03-11 | 1967 |
| 4 | What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 | 34406054602 | 1 | 168 | 1999-09-01 | 1978 |
| 5 | Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs | 49086007763 | 1 | 80 | 1983-12-29 | 416 |
| 6 | Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II | 73999140774 | 1 | 298 | 2000-04-01 | 96 |
| 7 | William Goldman: Four Screenplays | 73999254907 | 2 | 504 | 2000-05-01 | 95 |
| 8 | The Season: A Candid Look at Broadway | 73999768442 | 1 | 448 | 2004-07-01 | 1136 |
| 9 | The Beatles Complete - Updated Edition | 73999960822 | 1 | 303 | 1986-12-01 | 835 |
| 10 | Working Effectively with Legacy Code | 76092025986 | 1 | 464 | 2004-09-01 | 1591 |
select * from book_language
| language_id | language_code | language_name |
|---|---|---|
| 1 | eng | English |
| 2 | en-US | United States English |
| 3 | fre | French |
| 4 | spa | Spanish |
| 5 | en-GB | British English |
| 6 | mul | Multiple Languages |
| 7 | grc | Greek |
| 8 | enm | Middle English |
| 9 | en-CA | Canadian English |
| 10 | ger | German |
This section and the following take inspiration but not much content, from the LinkedIn learning course “Intermediate SQL for Data Scientists” - https://www.linkedin.com/learning/intermediate-sql-for-data-scientists
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
Indexes in SQLite are structures designed to improve the speed of data retrieval. They’re similar to indexes in a book, allowing for faster access to rows in a table based on the values of one or more columns.
Types of Indexes -
Single-Column Index - Created on a single column.
Multi-Column Index - Created on two or more columns.
Unique Index - Ensures that all values in the indexed column(s) are unique.
Primary Key Index - Automatically created when a table has a primary key constraint.
Automatic Indexes - Sometimes SQLite creates indexes automatically to optimize query performance, particularly for joins.
CREATE INDEX Customer_IDs ON cust_order(customer_id);
CREATE INDEX Customer_Orders ON cust_order(order_id, customer_id);
CREATE UNIQUE INDEX unique_countries ON country(country_name);
This SQL code above fails - A unique index does not discard non-unique values or automatically filter out duplicates. Instead, it enforces a constraint that prevents duplicates from being inserted into the table in the first place.
CREATE UNIQUE INDEX unique_status_ids ON order_status(status_id);
List all indexes associated with a table -
PRAGMA index_list(cust_order);
| seq | name | unique | origin | partial |
|---|---|---|---|---|
| 0 | Customer_Orders | 0 | c | 0 |
| 1 | Customer_IDs | 0 | c | 0 |
Get information about a specific index -
PRAGMA index_info(Customer_Orders);
| seqno | cid | name |
|---|---|---|
| 0 | 0 | order_id |
| 1 | 2 | customer_id |
Indexes can be dropped using the DROP INDEX statement -
DROP INDEX Customer_Orders;
Checking the index deletion has worked -
PRAGMA index_list(cust_order);
| seq | name | unique | origin | partial |
|---|---|---|---|---|
| 0 | Customer_IDs | 0 | c | 0 |
An example with a more complex and intricate query -
select * from book
| book_id | title | isbn13 | language_id | num_pages | publication_date | publisher_id |
|---|---|---|---|---|---|---|
| 1 | The World’s First Love: Mary Mother of God | 8987059752 | 2 | 276 | 1996-09-01 | 1010 |
| 2 | The Illuminati | 20049130001 | 1 | 352 | 2004-10-04 | 1967 |
| 3 | The Servant Leader | 23755004321 | 1 | 128 | 2003-03-11 | 1967 |
| 4 | What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 | 34406054602 | 1 | 168 | 1999-09-01 | 1978 |
| 5 | Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs | 49086007763 | 1 | 80 | 1983-12-29 | 416 |
| 6 | Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II | 73999140774 | 1 | 298 | 2000-04-01 | 96 |
| 7 | William Goldman: Four Screenplays | 73999254907 | 2 | 504 | 2000-05-01 | 95 |
| 8 | The Season: A Candid Look at Broadway | 73999768442 | 1 | 448 | 2004-07-01 | 1136 |
| 9 | The Beatles Complete - Updated Edition | 73999960822 | 1 | 303 | 1986-12-01 | 835 |
| 10 | Working Effectively with Legacy Code | 76092025986 | 1 | 464 | 2004-09-01 | 1591 |
select author_name from author where author_name == "Walter Scott"
| author_name |
|---|
| Walter Scott |
Creating an index of author names.
CREATE INDEX Author_names ON author(author_name);
Now our ‘server’ should use that index to optimise the performance of this query.
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling', 'Bill Bryson')
AND strftime('%Y', b.publication_date) IN ('1997','1998', '1999','2000')
ORDER BY ol.price DESC;
| title | isbn13 | num_pages | publication_date | author_name | price |
|---|---|---|---|---|---|
| Harry Potter and the Goblet of Fire (Harry Potter #4) | 9.780748e+12 | 636 | 2000-07-08 | J.K. Rowling | 15.04 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 13.73 |
| The Lost Continent: Travels in Small-town America | 9.780553e+12 | 379 | 1999-01-02 | Bill Bryson | 12.07 |
| Harry Potter and the Sorcerer’s Stone (Harry Potter #1) | 9.780786e+12 | 424 | 1999-11-12 | J.K. Rowling | 9.99 |
| The Lost Continent: Travels in Small-town America | 9.780553e+12 | 379 | 1999-01-02 | Bill Bryson | 9.60 |
| I’m a Stranger Here Myself: Notes on Returning to America After Twenty Years Away | 9.780768e+12 | 304 | 2000-06-28 | Bill Bryson | 9.40 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 2.71 |
| Notes from a Small Island | 9.780381e+12 | 324 | 1997-05-28 | Bill Bryson | 1.87 |
Views in SQLite are virtual tables that provide a way to represent the results of a query as a table.
Reasons to use Views -
Simplify Complex Queries - By encapsulating complex joins and calculations within a view, queries can be simpler and more understandable.
Enhance Security - Views can restrict access to specific data by exposing only certain columns or rows to users.
Provide Abstraction - Offer a layer of abstraction, allowing changes in the underlying database schema without affecting the end users.
Improve Maintainability - Views centralise query logic, making the system easier to maintain and modify.
CREATE VIEW Bryson_Books AS
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Bill Bryson');
CREATE VIEW Rowling_Books AS
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling');
CREATE VIEW Walter_Scott_Books AS
SELECT
b.title,
b.isbn13,
b.num_pages,
b.publication_date,
a.author_name,
ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Walter Scott');
SELECT * FROM Bryson_Books;
| title | isbn13 | num_pages | publication_date | author_name | price |
|---|---|---|---|---|---|
| Bizarre World | 9.780752e+12 | 120 | 2001-05-01 | Bill Bryson | 17.32 |
| The Lost Continent: Travels in Small Town America | 9.780061e+12 | 299 | 1990-08-28 | Bill Bryson | 14.36 |
| Neither Here nor There: Travels in Europe | 9.780381e+12 | 254 | 1993-03-28 | Bill Bryson | 12.47 |
| Made in America: An Informal History of the English Language in the United States | 9.780381e+12 | 364 | 2001-10-23 | Bill Bryson | 9.28 |
| Notes from a Small Island | 9.780381e+12 | 324 | 1997-05-28 | Bill Bryson | 1.87 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 2.71 |
| Bill Bryson: The Complete Notes | 9.780386e+12 | 544 | 2000-10-05 | Bill Bryson | 13.73 |
| A Short History of Nearly Everything (Illustrated Edition) | 9.780386e+12 | 624 | 2010-10-05 | Bill Bryson | 5.64 |
| A Short History of Nearly Everything (Illustrated Edition) | 9.780386e+12 | 624 | 2010-10-05 | Bill Bryson | 19.05 |
| Down Under | 9.780553e+12 | 398 | 2001-08-06 | Bill Bryson | 10.92 |
SELECT * FROM Walter_Scott_Books;
| title | isbn13 | num_pages | publication_date | author_name | price |
|---|---|---|---|---|---|
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 10.66 |
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 15.75 |
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 0.84 |
| The Antiquary | 9.780193e+12 | 528 | 2002-05-23 | Walter Scott | 13.99 |
| The Castle of Otranto | 9.780193e+12 | 125 | 1998-07-16 | Walter Scott | 3.56 |
| Waverley | 9.780193e+12 | 463 | 1998-08-20 | Walter Scott | 1.96 |
Views themselves are not directly updatable, but they can be dropped and recreated -
DROP VIEW IF EXISTS Bryson_Books;
SELECT name FROM sqlite_master WHERE type='view';
| name |
|---|
| Rowling_Books |
| Walter_Scott_Books |
SELECT sql FROM sqlite_master WHERE type='view' AND name='Walter_Scott_Books';
| sql |
|---|
| CREATE VIEW Walter_Scott_Books AS |
SELECT b.title, b.isbn13, b.num_pages, b.publication_date, a.author_name, ol.price FROM book b INNER JOIN book_author ba ON b.book_id = ba.book_id INNER JOIN author a ON ba.author_id = a.author_id INNER JOIN order_line ol ON b.book_id = ol.book_id INNER JOIN cust_order co ON ol.order_id = co.order_id WHERE a.author_name IN (‘Walter Scott’) |
The SUM() function is an aggregate function that calculates the total sum of a numeric column. The function is commonly used in conjunction with the GROUP BY clause to calculate sums for specific groups of data.
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SELECT SUM(street_name) as "No.different Street Names", SUM(city) as "No.different Cities" FROM address;
| No.different Street Names | No.different Cities |
|---|---|
| 87 | 0 |
As street_name and city are text columns, sum won’t work properly on them. It’s best to use COUNT(DISTINCT()).
SELECT COUNT(DISTINCT street_name) as "No. of Different Street Names",
COUNT(DISTINCT city) as "No. of Different Cities"
FROM address;
| No. of Different Street Names | No. of Different Cities |
|---|---|
| 958 | 974 |
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Order per Customer" DESC;
| Sum of Orders per Customer | Full Name |
|---|---|
| 112.44 | Abbot Tesseyman |
| 31.84 | Abbott Kendrew |
| 119.68 | Abby Chevins |
| 59.10 | Abel Trower |
| 88.54 | Abigael Trowbridge |
| 177.89 | Abraham Skudder |
| 104.84 | Adah Cotty |
| 143.15 | Addie Basterfield |
| 47.75 | Addison Sigg |
| 161.51 | Adrian Kunzelmann |
The AVG() function in SQL is an aggregate function that calculates the average value of a numeric column. It sums up all the values in the column and divides by the number of non-null values, providing the mean value.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Order per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name |
|---|---|---|
| 112.44 | 11.244000 | Abbot Tesseyman |
| 31.84 | 10.613333 | Abbott Kendrew |
| 119.68 | 13.297778 | Abby Chevins |
| 59.10 | 7.387500 | Abel Trower |
| 88.54 | 8.854000 | Abigael Trowbridge |
| 177.89 | 9.362632 | Abraham Skudder |
| 104.84 | 7.488571 | Adah Cotty |
| 143.15 | 10.225000 | Addie Basterfield |
| 47.75 | 9.550000 | Addison Sigg |
| 161.51 | 9.500588 | Adrian Kunzelmann |
The ROUND() function in SQL is used to round a numeric value to a specified number of decimal places. It takes two arguments: the number to be rounded and the number of decimal places to round to.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
ROUND(AVG(ol.price),2) AS "Average Order Price per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Order per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name |
|---|---|---|
| 112.44 | 11.24 | Abbot Tesseyman |
| 31.84 | 10.61 | Abbott Kendrew |
| 119.68 | 13.30 | Abby Chevins |
| 59.10 | 7.39 | Abel Trower |
| 88.54 | 8.85 | Abigael Trowbridge |
| 177.89 | 9.36 | Abraham Skudder |
| 104.84 | 7.49 | Adah Cotty |
| 143.15 | 10.22 | Addie Basterfield |
| 47.75 | 9.55 | Addison Sigg |
| 161.51 | 9.50 | Adrian Kunzelmann |
The VARIANCE() function is used to calculate the statistical variance of a set of numeric values. The variance defines a measure of the spread dispersion within a set of data.
However, SQLite does not have a built-in VARIANCE() function.
To calculate variance in SQLite, one can use a combination of SQL functions.
The sample variance estimates the variance from a sample of the population. The formula is -
\[ \sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \]
\[\sigma^2\] is the population variance
\[x_i\] represents each data point
\[\bar{x}\] is the sample mean
\[n - 1\] is the total number of data points in the population
Used when there’s only a sample and need the population variance needs to be estimated. The denominator is \[n - 1\] . \[n - 1\] instead of \[n\] to correct the bias in the estimation of the population variance from a sample (Bessel’s correction).
The SQL code -
SELECT SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1) AS sample_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;
WITH OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Sample Variance |
|---|---|---|---|
| 711.41 | 10.94 | Rich Kirsz | 32.82 |
| 663.41 | 10.21 | Farand Tremmil | 34.46 |
| 635.58 | 10.42 | Renado Sherrington | 34.50 |
| 622.31 | 12.20 | Lynda Roseborough | 26.15 |
| 611.59 | 10.54 | Mick Sever | 38.09 |
| 596.18 | 11.25 | Alysa Crombleholme | 29.67 |
| 583.81 | 9.73 | Emylee Hubbert | 37.33 |
| 574.20 | 10.07 | La verne Figg | 30.94 |
| 561.26 | 9.85 | Zora Hurles | 42.98 |
| 549.63 | 10.78 | Penny Bovingdon | 37.10 |
The population variance measures the dispersion of all data points in a population from the population mean \[\mu\]. The formula is -
\[ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} \]
\[\sigma^2\] is the population variance
\[x_i\] represents each data point
\[\mu\] is the mean of the population
\[n\] is the total number of data points in the population
With large datasets, the difference between sample and population variance is minimal since the correction factor (n - 1) vs. (n) has a smaller impact as the sample size increases.
The SQL code -
SELECT AVG((value - avg_value) * (value - avg_value)) AS population_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;
WITH OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Population Variance |
|---|---|---|---|
| 711.41 | 10.94 | Rich Kirsz | 32.31 |
| 663.41 | 10.21 | Farand Tremmil | 33.93 |
| 635.58 | 10.42 | Renado Sherrington | 33.94 |
| 622.31 | 12.20 | Lynda Roseborough | 25.64 |
| 611.59 | 10.54 | Mick Sever | 37.44 |
| 596.18 | 11.25 | Alysa Crombleholme | 29.11 |
| 583.81 | 9.73 | Emylee Hubbert | 36.71 |
| 574.20 | 10.07 | La verne Figg | 30.40 |
| 561.26 | 9.85 | Zora Hurles | 42.23 |
| 549.63 | 10.78 | Penny Bovingdon | 36.37 |
Standard deviation measures the dispersion of a dataset relative to its mean, indicating how spread out the data points are. A low standard deviation means the data points are close to the mean, while a high standard deviation indicates they are more spread out.
SQLite does not have a built-in STDDEV() or STDEV() function for calculating standard deviation.
These are the steps to calculate standard deviation in SQLite:
Calculate the mean of the dataset.
Compute Squared Differences calculate the squared difference of each value from the mean.
Aggregate and Calculate - Sum up the squared differences, divide by the count of values (for population standard deviation) or by (n - 1) (for sample standard deviation), and take the square root of the result.
The population standard deviation (()) is calculated using the formula:
\[ \sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}} \]
where: - () is the population standard deviation, - (x_i) represents each data point, - () is the mean of the population, - (N) is the total number of data points in the population.
The SQL code -
SELECT SQRT(AVG((value - avg_value) * (value - avg_value))) AS population_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;
WITH OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance",
ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count"), 2) AS "Order Price Standard Deviation"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Population Variance | Order Price Standard Deviation |
|---|---|---|---|---|
| 711.41 | 10.94 | Rich Kirsz | 32.31 | 5.68 |
| 663.41 | 10.21 | Farand Tremmil | 33.93 | 5.82 |
| 635.58 | 10.42 | Renado Sherrington | 33.94 | 5.83 |
| 622.31 | 12.20 | Lynda Roseborough | 25.64 | 5.06 |
| 611.59 | 10.54 | Mick Sever | 37.44 | 6.12 |
| 596.18 | 11.25 | Alysa Crombleholme | 29.11 | 5.39 |
| 583.81 | 9.73 | Emylee Hubbert | 36.71 | 6.06 |
| 574.20 | 10.07 | La verne Figg | 30.40 | 5.51 |
| 561.26 | 9.85 | Zora Hurles | 42.23 | 6.50 |
| 549.63 | 10.78 | Penny Bovingdon | 36.37 | 6.03 |
The sample standard deviation ((s)) is calculated using the formula:
\[ s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}} \]
where: - (s) is the sample standard deviation, - (x_i) represents each data point in the sample, - ({x}) is the sample mean, - (n) is the number of data points in the sample.
The SQL code -
SELECT SQRT(SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1)) AS sample_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;
WITH SampledCustomers AS (
SELECT
c.customer_id,
c.first_name,
c.last_name
FROM
customer c
ORDER BY
RANDOM()
LIMIT 50
),
OrderStats AS (
SELECT
c.customer_id,
SUM(ol.price) AS "Sum of Orders per Customer",
AVG(ol.price) AS "Average Order Price per Customer",
COUNT(ol.price) AS "Order Count",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
SampledCustomers c ON co.customer_id = c.customer_id
GROUP BY
c.customer_id, c.first_name, c.last_name
)
SELECT
os."Sum of Orders per Customer",
ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
os."Full Name",
ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance",
ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1)), 2) AS "Order Price Sample Standard Deviation"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
OrderStats os ON c.customer_id = os.customer_id
GROUP BY
os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Average Order Price per Customer | Full Name | Order Price Sample Variance | Order Price Sample Standard Deviation |
|---|---|---|---|---|
| 583.81 | 9.73 | Emylee Hubbert | 37.33 | 6.11 |
| 549.63 | 10.78 | Penny Bovingdon | 37.10 | 6.09 |
| 462.80 | 10.06 | Murray Senussi | 28.28 | 5.32 |
| 354.02 | 9.08 | Shae Leser | 33.24 | 5.77 |
| 264.18 | 10.16 | Ronnie Thaim | 27.34 | 5.23 |
| 247.00 | 11.23 | Shayna Farbrother | 32.07 | 5.66 |
| 213.33 | 9.70 | Urbanus Renfield | 33.70 | 5.81 |
| 197.43 | 10.39 | Wain Michel | 34.13 | 5.84 |
| 153.72 | 10.25 | Amalita Digginson | 58.57 | 7.65 |
| 126.77 | 7.92 | Heinrik Sibylla | 21.70 | 4.66 |
WITH(), also known as Common Table Expressions (CTEs), allows for improved readability and reusability of SQL queries. It’s particularly useful for breaking down complex queries into simpler, more manageable parts by creating temporary result sets that can be referenced within the main query.
WITH cte_name AS ( – CTE Query SELECT … ) SELECT … FROM cte_name
Key Differences
Subqueries - Can be less readable, especially when nested.
WITH() - Provide a clearer, more structured approach by defining temporary tables with meaningful names.
Subqueries - Generally not reusable; you need to repeat the subquery if it’s used in multiple places.
WITH() - Reusable within the main query, reducing redundancy and improving maintainability.
Subqueries - Harder to debug and maintain due to their nested nature.
WITH() - Easier to debug and maintain due to their clear, modular structure.
When using WITH()
-- Step 1: Calculate the total amount spent by each customer
WITH CustomerSpending AS (
SELECT
co.customer_id,
CONCAT(c.first_name, ' ', c.last_name) AS full_name,
SUM(ol.price) AS total_spent
FROM
cust_order co
INNER JOIN
order_line ol ON co.order_id = ol.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
co.customer_id, full_name
)
-- Step 2: Select the top 5 customers and join with book details
SELECT
b.title,
ol.price,
cs.full_name,
cs.total_spent
FROM
CustomerSpending cs
INNER JOIN
cust_order co ON cs.customer_id = co.customer_id
INNER JOIN
order_line ol ON co.order_id = ol.order_id
INNER JOIN
book b ON ol.book_id = b.book_id
WHERE
cs.customer_id IN (
SELECT customer_id
FROM CustomerSpending
ORDER BY total_spent DESC
LIMIT 5
)
ORDER BY
cs.total_spent DESC;
| title | price | full_name | total_spent |
|---|---|---|---|
| The Wall of the Sky the Wall of the Eye | 1.15 | Rich Kirsz | 711.41 |
| The Call of Earth (Homecoming #2) | 11.20 | Rich Kirsz | 711.41 |
| Prisoner’s Dilemma | 7.33 | Rich Kirsz | 711.41 |
| Of Love and Shadows | 14.97 | Rich Kirsz | 711.41 |
| Berlin Blues | 9.92 | Rich Kirsz | 711.41 |
| Triptych (Will Trent #1) | 0.76 | Rich Kirsz | 711.41 |
| If Beale Street Could Talk | 11.44 | Rich Kirsz | 711.41 |
| No Country for Old Men | 8.26 | Rich Kirsz | 711.41 |
| The Blood of the Moon: Understanding the Historic Struggle Between Islam and Western Civilization | 13.00 | Rich Kirsz | 711.41 |
| À tout jamais | 14.56 | Rich Kirsz | 711.41 |
The part with the WITH inner query -
This CTE calculates the total amount spent on each order for every customer.
It joins the cust_order,order_line, and customers tables and groups the data by customer ID and full name to compute the total order amount.
The rest, the outer query -
The main query selects from the CustomerSpending CTE.
It also gets the book title and price from the book and order line tables respectively, by INNER JOINs.
A WHERE clause specifies the results should only contains the top 5 spending customers in CustomerSpending.
The results are ordered by the top spenders, descending.
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
The LIKE operator is used for pattern matching within text fields. It allows the use of % to represent zero or more characters and _ to represent a single character. For example, the query SELECT * FROM customers WHERE name LIKE '%enko'; will return all rows where the name column ends with “enko”.
Getting all surnames starting with ‘Mac’ -
select * from customer where last_name like 'Mac%'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 114 | Aurelie | MacSherry | amacsherry35@is.gd |
| 211 | Desiri | MacDunleavy | dmacdunleavy5u@behance.net |
| 299 | Yvonne | Maccree | ymaccree8a@mac.com |
| 416 | Zaria | MacCafferky | zmaccafferkybj@trellian.com |
| 481 | Kaleb | MacClenan | kmacclenandc@ocn.ne.jp |
| 500 | Aurelia | MacGowan | amacgowandv@opera.com |
| 604 | Agace | MacKinnon | amackinnongr@live.com |
| 951 | Wilma | MacGorrie | wmacgorrieqe@nba.com |
| 979 | Danell | Mace | dmacer6@clickbank.net |
| 994 | Archibold | MacNab | amacnabrl@nih.gov |
Getting all surnames starting with ‘Mc’ -
select * from customer where last_name like 'Mc%'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 73 | Ruthi | McGeever | rmcgeever20@sakura.ne.jp |
| 168 | Lynsey | McPeice | lmcpeice4n@technorati.com |
| 187 | Miner | McLay | mmclay56@plala.or.jp |
| 199 | Bjorn | McCloud | bmccloud5i@aboutads.info |
| 317 | Stesha | McAlees | smcalees8s@wikispaces.com |
| 319 | Win | McManamon | wmcmanamon8u@ovh.net |
| 347 | Ephrem | Mc Ilwrick | emcilwrick9m@wikimedia.org |
| 410 | Rriocard | McPhail | rmcphailbd@huffingtonpost.com |
| 539 | Jane | McCreath | jmccreathey@samsung.com |
| 772 | Tybalt | McOwen | tmcowenlf@storify.com |
Getting all surnames ending with ‘vich’ -
select * from customer where last_name like '%vich'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 578 | Lynea | Matskevich | lmatskevichg1@japanpost.jp |
| 1545 | Deana | Matusevich | dmatusevichf4@multiply.com |
Getting all surnames ending with ‘enko’ -
select * from customer where last_name like '%enko'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 174 | Zachery | Hriinchenko | zhriinchenko4t@ocn.ne.jp |
| 1062 | Harold | Izacenko | hizacenko1p@pen.io |
Getting all surnames starting or ending with ‘man’ (lower or upper case) -
select * from customer where last_name like '%man%'
| customer_id | first_name | last_name | |
|---|---|---|---|
| 218 | Abbot | Tesseyman | atesseyman61@hc360.com |
| 221 | Parker | Strangman | pstrangman64@barnesandnoble.com |
| 319 | Win | McManamon | wmcmanamon8u@ovh.net |
| 360 | Dulci | Portman | dportman9z@phoca.cz |
| 366 | Giulia | Borrowman | gborrowmana5@github.com |
| 402 | Charlean | Palphramand | cpalphramandb5@amazonaws.com |
| 413 | Sigfried | Mansel | smanselbg@about.com |
| 430 | Adrian | Kunzelmann | akunzelmannbx@state.gov |
| 483 | Antonie | Liebmann | aliebmannde@blogs.com |
| 501 | Salome | Elliman | sellimandw@1688.com |
Making a new column specifically based on the like condition -
SELECT
last_name as "Customer Surname",
CASE
WHEN last_name LIKE '%enko'
OR last_name LIKE '%vich'
OR last_name LIKE '%vych'
OR last_name LIKE '%chuk'
OR last_name LIKE '%chyk'
OR last_name LIKE '%ski'
OR last_name LIKE '%sky'
OR last_name LIKE '%uk'
OR last_name LIKE '%ko'
OR last_name LIKE '%yshyn'
OR last_name LIKE '%iv' THEN 'Yes'
ELSE 'No'
END AS 'Potentially Ukrainian Surname?'
FROM customer
WHERE last_name LIKE '%enko'
OR last_name LIKE '%vich'
OR last_name LIKE '%vych'
OR last_name LIKE '%chuk'
OR last_name LIKE '%chyk'
OR last_name LIKE '%ski'
OR last_name LIKE '%sky'
OR last_name LIKE '%uk'
OR last_name LIKE '%ko'
OR last_name LIKE '%yshyn'
OR last_name LIKE '%iv';
| Customer Surname | Potentially Ukrainian Surname? |
|---|---|
| Hriinchenko | Yes |
| Davidofski | Yes |
| Bagniuk | Yes |
| Georgievski | Yes |
| Matskevich | Yes |
| Malinowski | Yes |
| Matschuk | Yes |
| Izacenko | Yes |
| Dmiterko | Yes |
| Matusevich | Yes |
Unfortunately, SQLite that’s being used here does not support the command ‘similar to’ that could be used in PostGreSQL.
If PostGreSQL was being used, the above query could be shortened a lot -
SELECT last_name, CASE WHEN last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’ THEN ‘YES’ ELSE ‘NO’ END AS “Ukrainian Surname” FROM customer WHERE last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’;
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
SOUNDEX is a phonetic algorithm that indexes words by their sound when pronounced in English. This can be useful for matching words that sound similar but are spelled differently. SQLite has this function.
SELECT SOUNDEX('George');
| SOUNDEX(‘George’) |
|---|
| G620 |
SELECT last_name AS 'Surnames'
FROM customer
WHERE SOUNDEX(last_name) = SOUNDEX('Smyth');
| Surnames |
|---|
| Smoth |
| Sineath |
select soundex('Postgres'), soundex('Postgresss'), ('Postgres' = 'Postgresss'),
soundex('Postgres') = soundex('Postgresss')
| soundex(‘Postgres’) | soundex(‘Postgresss’) | (‘Postgres’ = ‘Postgresss’) | soundex(‘Postgres’) = soundex(‘Postgresss’) |
|---|---|---|---|
| P232 | P232 | 0 | 1 |
A SOUNDEX code consists of a letter followed by three digits, representing the phonetic pattern of the word. Here’s how it is constructed:
First Letter: The first letter of the word is kept as the first letter of the SOUNDEX code.
Digits: The remaining letters are converted to digits based on their phonetic sound:
B, F, P, V → 1
C, G, J, K, Q, S, X, Z → 2
D, T → 3
L → 4
M, N → 5
R → 6
Similar Sounds: Adjacent letters that represent the same sound are collapsed into a single digit.
Vowels and Certain Letters: A, E, I, O, U, H, W, and Y are ignored unless they are the first letter.
Truncation/Zero Padding: The code is truncated to four characters if necessary, or zero-padded to ensure it is four characters long.
For example, the SOUNDEX code “P232” is generated from the word “Postgres” as follows:
‘P’ is the first letter.
‘o’ is ignored.
‘s’ maps to 2.
‘t’ maps to 3.
‘g’ maps to 2.
Remaining letters (‘r’, ‘e’, ‘s’) are either ignored or do not change the pattern as the code is already four characters long.
Thus, “Postgres” becomes “P232”.
select difference ('Postgres', 'Postgresss') as "Difference between the strings Postgres and Postgresss"
| Difference between the strings Postgres and Postgresss |
|---|
| 4 |
The DIFFERENCE() function in SQL compares the SOUNDEX values of two strings and returns an integer value between 0 and 4, indicating the degree of similarity between the two strings. A result of 4 means the strings sound very similar, while a result of 0 means they sound very different. This function is particularly useful for fuzzy matching in text searches.
The levenshtein() function calculates the Levenshtein distance between two strings, which is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. It is commonly used to measure the similarity between two strings, with a lower distance indicating greater similarity.
SQLite doesn’t have it unfortunately!
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Full Name |
|---|---|
| 711.41 | Rich Kirsz |
| 663.41 | Farand Tremmil |
| 635.58 | Renado Sherrington |
| 622.31 | Lynda Roseborough |
| 611.59 | Mick Sever |
| 596.18 | Alysa Crombleholme |
| 583.81 | Emylee Hubbert |
| 574.20 | La verne Figg |
| 561.26 | Zora Hurles |
| 549.63 | Penny Bovingdon |
The HAVING clause is used to specify a condition for groups of rows created by the GROUP BY clause, similar to how the WHERE clause is used to specify a condition for individual rows.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
HAVING
"Sum of Orders per Customer" > 600
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Full Name |
|---|---|
| 711.41 | Rich Kirsz |
| 663.41 | Farand Tremmil |
| 635.58 | Renado Sherrington |
| 622.31 | Lynda Roseborough |
| 611.59 | Mick Sever |
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")
Window functions in SQL perform calculations across a set of table rows that are somehow related to the current row. Unlike aggregate functions, which return a single value for a group of rows, window functions can return multiple rows for each row in the query, preserving the row’s identity. These functions are often used for running totals, moving averages, and other cumulative calculations.
ROW_NUMBER() - Assigns a unique sequential integer to rows within a partition of a result set.
Assigns a unique sequential integer to each row, ordered by “Sum of Orders per Customer” in descending order, useful for ranking albeit with ties (there can’t be 2 third placed rows for example).
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
ROW_NUMBER() OVER (ORDER BY SUM(ol.price) DESC) AS "Row Number"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Full Name | Row Number |
|---|---|---|
| 711.41 | Rich Kirsz | 1 |
| 663.41 | Farand Tremmil | 2 |
| 635.58 | Renado Sherrington | 3 |
| 622.31 | Lynda Roseborough | 4 |
| 611.59 | Mick Sever | 5 |
| 596.18 | Alysa Crombleholme | 6 |
| 583.81 | Emylee Hubbert | 7 |
| 574.20 | La verne Figg | 8 |
| 561.26 | Zora Hurles | 9 |
| 549.63 | Penny Bovingdon | 10 |
Assigns a rank to each row, with gaps in the ranking sequence for ties.
Assigns a rank to each customer based on their “Sum of Orders per Customer”, allowing for ties where customers with the same sum receive the same rank. No 89th position just two 88 positions.
WITH RankedCustomers AS (
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
RANK() OVER (ORDER BY SUM(ol.price) DESC) AS "Rank"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
)
SELECT *
FROM RankedCustomers
WHERE "Rank" BETWEEN 85 AND 95
ORDER BY "Rank";
| Sum of Orders per Customer | Full Name | Rank |
|---|---|---|
| 337.07 | Krishnah Traite | 85 |
| 336.50 | Cecily Norley | 86 |
| 336.23 | Emory Bagniuk | 87 |
| 334.23 | Francis Cutmore | 88 |
| 334.23 | Ruthanne Vatini | 88 |
| 331.91 | Leone Utterson | 90 |
| 330.95 | Nicolina Scorthorne | 91 |
| 329.13 | Vladamir Ellerey | 92 |
| 328.39 | Berti Gowlett | 93 |
| 328.25 | Daisey Lamball | 94 |
DENSE_RANK() does not leave gaps and provides consecutive ranking numbers regardless of the number of ties whereas RANK() leaves gaps in the ranking sequence if there are ties. An 89th position as well as two 88 positions.
WITH RankedCustomers AS (
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
DENSE_RANK() OVER (ORDER BY SUM(ol.price) DESC) AS "Dense Rank"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
)
SELECT *
FROM RankedCustomers
WHERE "Dense Rank" BETWEEN 85 AND 95
ORDER BY "Dense Rank";
| Sum of Orders per Customer | Full Name | Dense Rank |
|---|---|---|
| 337.07 | Krishnah Traite | 85 |
| 336.50 | Cecily Norley | 86 |
| 336.23 | Emory Bagniuk | 87 |
| 334.23 | Francis Cutmore | 88 |
| 334.23 | Ruthanne Vatini | 88 |
| 331.91 | Leone Utterson | 89 |
| 330.95 | Nicolina Scorthorne | 90 |
| 329.13 | Vladamir Ellerey | 91 |
| 328.39 | Berti Gowlett | 92 |
| 328.25 | Daisey Lamball | 93 |
Distributes the rows into a specified number of buckets (here, deciles), providing a way to understand distribution across different segments.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
NTILE(10) OVER (ORDER BY SUM(ol.price) DESC) AS "Decile"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC;
| Sum of Orders per Customer | Full Name | Decile |
|---|---|---|
| 711.41 | Rich Kirsz | 1 |
| 663.41 | Farand Tremmil | 1 |
| 635.58 | Renado Sherrington | 1 |
| 622.31 | Lynda Roseborough | 1 |
| 611.59 | Mick Sever | 1 |
| 596.18 | Alysa Crombleholme | 1 |
| 583.81 | Emylee Hubbert | 1 |
| 574.20 | La verne Figg | 1 |
| 561.26 | Zora Hurles | 1 |
| 549.63 | Penny Bovingdon | 1 |
Provides access to the next value in the query result, useful for comparing each customer’s sum with the next customer’s sum.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
LEAD(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "Next Customer's Sum"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC
LIMIT 5;
| Sum of Orders per Customer | Full Name | Next Customer’s Sum |
|---|---|---|
| 711.41 | Rich Kirsz | 663.41 |
| 663.41 | Farand Tremmil | 635.58 |
| 635.58 | Renado Sherrington | 622.31 |
| 622.31 | Lynda Roseborough | 611.59 |
| 611.59 | Mick Sever | 596.18 |
The converse of LEAD(). LAG() provides access to the previous value in the query result, useful for comparing each customer’s sum with the next customer’s sum.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
LAG(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "Previous Customer's Sum"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC
LIMIT 5;
| Sum of Orders per Customer | Full Name | Previous Customer’s Sum |
|---|---|---|
| 711.41 | Rich Kirsz | NA |
| 663.41 | Farand Tremmil | 711.41 |
| 635.58 | Renado Sherrington | 663.41 |
| 622.31 | Lynda Roseborough | 635.58 |
| 611.59 | Mick Sever | 622.31 |
Returns the sum of orders for the top customer in the ordered list as a seperate column, giving a reference point for the highest sum of orders.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
FIRST_VALUE(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "First Customer Sum"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC
LIMIT 5;
| Sum of Orders per Customer | Full Name | First Customer Sum |
|---|---|---|
| 711.41 | Rich Kirsz | 711.41 |
| 663.41 | Farand Tremmil | 711.41 |
| 635.58 | Renado Sherrington | 711.41 |
| 622.31 | Lynda Roseborough | 711.41 |
| 611.59 | Mick Sever | 711.41 |
Returns the sum of orders for the bottom customer in the ordered list as a seperate column, giving a reference point for the lowest sum of orders.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
LAST_VALUE(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "Last Customer Sum"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" ASC
LIMIT 5;
| Sum of Orders per Customer | Full Name | Last Customer Sum |
|---|---|---|
| 0.25 | Frans Teague | 0.25 |
| 0.29 | Robinson Renfrew | 0.29 |
| 0.62 | Cordelie Rickertsen | 0.62 |
| 0.72 | Kasey Jillitt | 0.72 |
| 1.32 | Alfons Blabie | 1.32 |
SUM() with OVER() calculates the total sum of all orders across all customers, providing a grand total within the context of each row.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
SUM(SUM(ol.price)) OVER () AS "Total Sum of All Customers"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC
LIMIT 5;
| Sum of Orders per Customer | Full Name | Total Sum of All Customers |
|---|---|---|
| 711.41 | Rich Kirsz | 152763 |
| 663.41 | Farand Tremmil | 152763 |
| 635.58 | Renado Sherrington | 152763 |
| 622.31 | Lynda Roseborough | 152763 |
| 611.59 | Mick Sever | 152763 |
AVG() with OVER() calculates the total sum of all orders across all customers, providing a grand total within the context of each row.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
AVG(SUM(ol.price)) OVER () AS "Average Order Sum"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC
LIMIT 5;
| Sum of Orders per Customer | Full Name | Average Order Sum |
|---|---|---|
| 711.41 | Rich Kirsz | 115.8173 |
| 663.41 | Farand Tremmil | 115.8173 |
| 635.58 | Renado Sherrington | 115.8173 |
| 622.31 | Lynda Roseborough | 115.8173 |
| 611.59 | Mick Sever | 115.8173 |
MAX() with OVER() returns the maximum sum of orders found in the dataset, providing a reference for the highest order sum among all customers.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
MAX(SUM(ol.price)) OVER () AS "Max. Sum of Orders"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC
LIMIT 5;
| Sum of Orders per Customer | Full Name | Max. Sum of Orders |
|---|---|---|
| 711.41 | Rich Kirsz | 711.41 |
| 663.41 | Farand Tremmil | 711.41 |
| 635.58 | Renado Sherrington | 711.41 |
| 622.31 | Lynda Roseborough | 711.41 |
| 611.59 | Mick Sever | 711.41 |
MIN() with OVER() returns the minimum sum of orders found in the dataset, providing a reference for the highest order sum among all customers.
SELECT
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
MIN(SUM(ol.price)) OVER () AS "Min. Sum of Orders"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
GROUP BY
c.first_name, c.last_name
ORDER BY
"Sum of Orders per Customer" DESC
LIMIT 5;
| Sum of Orders per Customer | Full Name | Min. Sum of Orders |
|---|---|---|
| 711.41 | Rich Kirsz | 0.25 |
| 663.41 | Farand Tremmil | 0.25 |
| 635.58 | Renado Sherrington | 0.25 |
| 622.31 | Lynda Roseborough | 0.25 |
| 611.59 | Mick Sever | 0.25 |
A partition is a subset of rows in the result set. When you use a window function with the PARTITION BY clause, the window function is applied to each partition separately. This is useful for performing calculations within each group of rows.
SELECT
a.city,
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
RANK() OVER (PARTITION BY a.city ORDER BY SUM(ol.price) DESC) AS "Rank"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
GROUP BY
a.city, c.first_name, c.last_name
ORDER BY
a.city, "Rank";
| city | Sum of Orders per Customer | Full Name | Rank |
|---|---|---|---|
| A Yun Pa | 123.08 | Joelie Menhci | 1 |
| A Yun Pa | 85.49 | Mariette Tulley | 2 |
| A Yun Pa | 22.89 | Cristen Tooby | 3 |
| Acacías | 58.79 | Pierette Duplain | 1 |
| Acacías | 55.95 | Grantley Clyburn | 2 |
| Acacías | 46.78 | Bennie Trigg | 3 |
| Afántou | 115.89 | Constance Raeburn | 1 |
| Afántou | 33.64 | Linea Gaither | 2 |
| Aghavnadzor | 171.43 | York Londsdale | 1 |
| Aghavnadzor | 35.28 | Patsy Hicks | 2 |
CUME_DIST() calculates the cumulative distribution of a value within a partition. It returns the relative position of a value within the partition as a number between 0 and 1. It shows the proportion of rows that have a value less than or equal to the current row’s value.
SELECT
a.city,
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
CUME_DIST() OVER (PARTITION BY a.city ORDER BY SUM(ol.price) DESC) AS "Cumulative Distribution"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
WHERE
a.city = 'A Yun Pa' -- Filter for specific city here
GROUP BY
a.city, c.first_name, c.last_name
ORDER BY
a.city, "Sum of Orders per Customer" DESC;
| city | Sum of Orders per Customer | Full Name | Cumulative Distribution |
|---|---|---|---|
| A Yun Pa | 123.08 | Joelie Menhci | 0.3333333 |
| A Yun Pa | 85.49 | Mariette Tulley | 0.6666667 |
| A Yun Pa | 22.89 | Cristen Tooby | 1.0000000 |
The above result means 100% of customers have a sum of orders less than or equal to Cristen’s, 2/3 less than or equal to Mariettes’s, and 1/3% less than or equal to Joelie’s.
The WIDTH_BUCKET function divides a continuous range of values into a specified number of equal-width buckets, returning the bucket number for each value. It is commonly used to categorize numeric data into intervals for analysis.
For example -
Below one can see that the salary column has been divided 10 times, from the values 0 to 150,000.
knitr::include_graphics("C:/Users/gam55/Downloads/width_bucket_screenshot.png")
SQLite does not support the WIDTH_BUCKET function directly but it can be simulated using a series of CASE statements.
SELECT
a.city,
SUM(ol.price) AS "Sum of Orders per Customer",
CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
CASE
WHEN SUM(ol.price) < 100 THEN 3
WHEN SUM(ol.price) BETWEEN 100 AND 200 THEN 2
WHEN SUM(ol.price) BETWEEN 201 AND 300 THEN 1
ELSE 4
END AS "Bucket"
FROM
order_line ol
INNER JOIN
cust_order co ON ol.order_id = co.order_id
INNER JOIN
customer c ON co.customer_id = c.customer_id
INNER JOIN
address a ON co.dest_address_id = a.address_id
GROUP BY
a.city, c.first_name, c.last_name
ORDER BY
a.city, "Sum of Orders per Customer" DESC;
| city | Sum of Orders per Customer | Full Name | Bucket |
|---|---|---|---|
| A Yun Pa | 123.08 | Joelie Menhci | 2 |
| A Yun Pa | 85.49 | Mariette Tulley | 3 |
| A Yun Pa | 22.89 | Cristen Tooby | 3 |
| Acacías | 58.79 | Pierette Duplain | 3 |
| Acacías | 55.95 | Grantley Clyburn | 3 |
| Acacías | 46.78 | Bennie Trigg | 3 |
| Afántou | 115.89 | Constance Raeburn | 2 |
| Afántou | 33.64 | Linea Gaither | 3 |
| Aghavnadzor | 171.43 | York Londsdale | 2 |
| Aghavnadzor | 35.28 | Patsy Hicks | 3 |
SQLite supports the WITH RECURSIVE clause, which allows you to write recursive CTEs. This feature is useful for querying hierarchical or recursive data, such as workplace hierarchy or family trees.
Some of the explanations below have been lifted from https://learnsql.com/blog/sql-recursive-cte/
Recap on ‘common table expressions’, CTEs
“The CTE (common table expression), also known as the WITH clause, is an SQL feature that returns a temporary data set that can be used by another query. As it’s a temporary result, it’s not stored anywhere, but it still can be referenced like you would reference any other table.”
“A recursive CTE references itself. It returns the result subset, then it repeatedly (recursively) references itself, and stops when it returns all the results.”
The general syntax of a non-recursive CTE -
WITH cte_name AS (cte_query_definition)
SELECT *
FROM cte_name;
“The first part of the syntax is the CTE. It begins with the keyword WITH. Then you give your CTE a name. After you follow that by the AS keyword, you can define CTE in the parentheses.”
“The second part of the syntax is a simple SELECT statement. It is written immediately after the recursive CTE, without any commas, semicolons, or similar marks. The CTE is used in another query just like any other table. This is exactly what the SELECT statement does.”
Recursive CTEs
A recursive CTE references itself. It returns the result subset, then it repeatedly (recursively) references itself, and stops when it returns all the results.
The syntax for a recursive CTE is not too different from that of a non-recursive CTE -
WITH RECURSIVE cte_name AS (
cte_query_definition (the anchor member)
UNION ALL
cte_query_definition (the recursive member)
)
SELECT *
FROM cte_name;
“Again, at the beginning of your CTE is the WITH clause. However, if you want your CTE to be recursive, then after WITH you write the RECURSIVE keyword. Then it’s business as usual: AS is followed by the parentheses with the CTE query definition. This first query definition is called the anchor member.”
“To connect the anchor member with the recursive member, you need to use the UNION or UNION ALL command. The recursive member is, obviously, the recursive part of CTE that will reference the CTE itself.”
Finding Hierarchy in a Workplace Example
The ‘Gravity Bookstore’ used so far in this document doesn’t have any kind of hierarchical data some needs to be created -
CREATE TABLE employee (
employee_id INTEGER PRIMARY KEY,
name TEXT,
position TEXT,
manager_id INTEGER
);
INSERT INTO employee (employee_id, name, position, manager_id) VALUES
(1, 'Alice', 'CEO', NULL), -- Top-level management
(2, 'Bob', 'VP of Sales', 1),
(3, 'Carol', 'VP of Engineering', 1),
(4, 'Dave', 'Sales Manager', 2),
(5, 'Eve', 'Sales Executive', 4),
(6, 'Frank', 'Sales Executive', 4),
(7, 'Grace', 'Engineering Manager', 3),
(8, 'Heidi', 'Senior Engineer', 7),
(9, 'Ivan', 'Junior Engineer', 7),
(10, 'Judy', 'CTO', 1),
(11, 'Mallory', 'Sales Associate', 5),
(12, 'Oscar', 'Sales Associate', 5),
(13, 'Peggy', 'Engineering Intern', 9),
(14, 'Sybil', 'Product Manager', 3),
(15, 'Trent', 'DevOps Engineer', 7),
(16, 'Victor', 'Lead Architect', 10),
(17, 'Walter', 'Data Scientist', 16),
(18, 'Xavier', 'ML Engineer', 17),
(19, 'Yvonne', 'UX Designer', 14),
(20, 'Zara', 'Graphic Designer', 19);
select * from employee
| employee_id | name | position | manager_id |
|---|---|---|---|
| 1 | Alice | CEO | NA |
| 2 | Bob | VP of Sales | 1 |
| 3 | Carol | VP of Engineering | 1 |
| 4 | Dave | Sales Manager | 2 |
| 5 | Eve | Sales Executive | 4 |
| 6 | Frank | Sales Executive | 4 |
| 7 | Grace | Engineering Manager | 3 |
| 8 | Heidi | Senior Engineer | 7 |
| 9 | Ivan | Junior Engineer | 7 |
| 10 | Judy | CTO | 1 |
WITH RECURSIVE company_hierarchy AS (
SELECT
employee_id,
name,
position,
manager_id,
0 AS hierarchy_level
FROM
employee
WHERE
manager_id IS NULL
UNION ALL
SELECT
e.employee_id,
e.name,
e.position,
e.manager_id,
ch.hierarchy_level + 1
FROM
employee e,
company_hierarchy ch
WHERE
e.manager_id = ch.employee_id
)
SELECT
ch.name AS "Employee name",
e.name AS "Boss name",
ch.position AS "Employee Position",
e.position AS "Boss Position",
ch.hierarchy_level AS "Hierarchy Level"
FROM
company_hierarchy ch
LEFT JOIN
employee e ON ch.manager_id = e.employee_id
ORDER BY
ch.hierarchy_level,
ch.manager_id;
| Employee name | Boss name | Employee Position | Boss Position | Hierarchy Level |
|---|---|---|---|---|
| Alice | NA | CEO | NA | 0 |
| Bob | Alice | VP of Sales | CEO | 1 |
| Carol | Alice | VP of Engineering | CEO | 1 |
| Judy | Alice | CTO | CEO | 1 |
| Dave | Bob | Sales Manager | VP of Sales | 2 |
| Grace | Carol | Engineering Manager | VP of Engineering | 2 |
| Sybil | Carol | Product Manager | VP of Engineering | 2 |
| Victor | Judy | Lead Architect | CTO | 2 |
| Eve | Dave | Sales Executive | Sales Manager | 3 |
| Frank | Dave | Sales Executive | Sales Manager | 3 |
The above is a recursive query, so it starts with WITH RECURSIVE. The name of the CTE is company_hierarchy. After AS, the CTE definition is in the parentheses.
The first SELECT statement, the anchoring statement, selects all the employee table columns where the column manager_id is NULL. In short, it will select Alice, because only she has a NULL value in that column - starting the recursion from the top of the organizational structure. There’s also a column hierarchy_level with the value of 0. That means the head of the company’s level is 0 – they’re on top of the hierarchy.
UNION ALL is used to connect this SELECT statement with the second one, the recursive member. In the recursive member, all the columns from the table employees ares selected and the CTE company_hierarchy where the column boss_id is equal to the column id. Notice the part hierarchy_level + 1.This means that with every recursion, the CTE will add 1 to the previous hierarchy level, and it will do that until it reaches the end of the hierarchy. Also note that the CTE is treated as any other table. To finish defining the CTE, the brackets are closed.
Finally, there’s a third SELECT statement, outside of the CTE. It selects the columns that will show employees, their bosses’ names, and the hierarchy level. Data is taken from the CTE and the table employees. The CTE and employees table are joined with a LEFT JOIN, since we want all the data from the CTE – including Alice, who has the NULL value in the column boss_id.
In a very simplistic way -
“Recursive CTEs are used to handle and query data that refers to itself, like finding all employees who report to a manager in a company hierarchy.”